A Rule-Based Data Standardizer for Enterprise Data Bases

نویسندگان

  • Abhik Roychoudhury
  • I. V. Ramakrishnan
  • Terrance Swift
چکیده

Whenever a database permits textual entry of information | for example when data is copied from a paper form | the database is likely to contain duplicates and inconsistencies. These duplicates must be removed and inconsistencies resolved in order to mine the data or to use the data for decision support. We term the domain-speci c solution to duplicate and inconsistency removal data standardization. In this paper, we describe a Name-Address Standardizer, one of a series of standardizers that have proven critical in creating a new enterprise-level database for the U.S. Customs Service. The standardizers were used to clean several legacy databases. These standardized databases were combined into a central database for which data is now standardized upon input. In practice, a standardizer uses techniques both from natural language analysis and from rule-based expert systems. As a result Prolog is highly suitable as a basis for standardizers. All Customs standardizers were written almost entirely in Prolog and constitute a large programming e ort: the Name-Address Standardizer contains about 100,000 lines of code, including generated parse tables and a fact base.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Method for Selecting the Supplier Based on Association Rule Mining

One of important problems in supply chains management is supplier selection. In a company, there are massive data from various departments so that extracting knowledge from the company’s data is too complicated. Many researchers have solved this problem by some methods like fuzzy set theory, goal programming, multi objective programming, the liner programming, mixed integer programming, analyti...

متن کامل

Improvement of Rule Generation Methods for Fuzzy Controller

This paper proposes fuzzy modeling using obtained data. Fuzzy system is known as knowledge-based or rule-bases system. The most important part of fuzzy system is rule-base. One of problems of generation of fuzzy rule with training data is inconsistence data. Existence of inconsistence and uncertain states in training data causes high error in modeling. Here, Probability fuzzy system presents to...

متن کامل

A Case Study in Using Preference Logic Grammars for Knowledge Representations

Data standardization is the commercially important process of extracting useful information from poorly structured textual data. This process includes correcting misspellings and truncations, extraction of data via parsing, and correcting inconsistencies in extracted data. Prolog programming ooers natural advantages for standardizing: dee-nite clause grammars can be used to parse data; Prolog r...

متن کامل

Investigating the missing data effect on credit scoring rule based models: The case of an Iranian bank

Credit risk management is a process in which banks estimate probability of default (PD) for each loan applicant. Data sets of previous loan applicants are built by gathering their data, and these internal data sets are usually completed using external credit bureau’s data and finally used for estimating PD in banks. There is also a continuous interest for bank to use rule based classifiers to b...

متن کامل

S3PSO: Students’ Performance Prediction Based on Particle Swarm Optimization

Nowadays, new methods are required to take advantage of the rich and extensive gold mine of data given the vast content of data particularly created by educational systems. Data mining algorithms have been used in educational systems especially e-learning systems due to the broad usage of these systems. Providing a model to predict final student results in educational course is a reason for usi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997